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Abstract 

To find the best lattice model representation of a given full atom protein structure is a hard com- 
putational problem. Several greedy methods have been suggested where results are usually biased 



pg and leave room for improvement. 



In this paper we formulate and implement a Constraint Programming method to refine such lattice 
jrt' structure models. We show that the approach is able to provide better quality solutions. The proto- 

^) type is implemented in COLA and is based on limited discrepancy search. Finally, some promising 

^^ extensions based on local search are discussed. 



1 Introduction 

rj Extensive structural protein studies are computationally not feasible using full atom protein representa- 

• tions. The challenge is to reduce complexity while maintaining detail lUdll. Lattice protein models 

O are often used to achieve this but in general only the protein backbone or the amino acid center of mass 

is represented Ol [161 [HI EOl IHl . A huge variety of lattices and energy functions have previously been 

^~~^ developed |l5l[8l|28l, while the lattices 2D-square, 3D-cubic and 3D face centered cubic (FCC) are most 

J^ prominent. 

l/^ In order to evaluate the applicability of different lattices and to enable the transformation of real 

OO protein structures into lattice models, a representative lattice protein structure has to be calculated. In 

detail, given a full atom protein structure one has to find the best structure representation within the lattice 
model that minimizes the applied distance measure. Maiiuch and Gaur have shown the NP-completeness 

(^ of this problem for backbone-only models in the 3D-cubic lattice when minimizing coordinate root mean 

—^ square deviation (cRMSD) and named it the protein chain lattice fitting (PCLF) problem [19]. 

^ The PCLF problem has been widely studied for backbone-only models. Suggested approaches utilize 

quite different methods, ranging from full enumeration [4], greedy chain growth strategies |[T7ll20ll23]| . 
dynamic programming [ 10 1 , simulated annealing 1 25 1 , or the optimization of specialized force fields |[T3l 
l27l . The most important aspects in producing lattice protein models with a low root mean squared 
deviation (RMSD) are the lattice co-ordination number and the neighborhood vector angles ||23l l24ll . 
Lattices with intermediate co-ordination numbers, such as the face-centered cubic (FCC) lattice, can 
produce high resolution backbone models Ii23il and have been used in many protein structure studies (e.g. 

Most of the PCFL methods introduced are heuristics to derive good solutions in reasonable time. 
Greedy methods as chain growth algorithms |[T7l l20l l23i enable low runtimes but the fitting quality 
depends on the chain growth direction and parameterization. Thus, resulting lattice models are biased by 
the method applied and have potential for refinement. 

This paper has the goal to provide some evidence that greedy methods can be effectively improved 
by subsequent refinement steps that increase the fitting quality. We present a formalization and a simple 
working prototype. Moreover we briefly discuss some potential methodologies that we expect could be 
effectively employed. 
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2 Definitions and Preliminaries 

In order to define tlie Constraint Programming approach we first introduce some preliminary formalisms. 

Given a protein in full atom representation of lengtli n (e.g. in Protein Data Base (PDB) format |[2l). 
we denote the sequence of 3D-coordinates of its C^-atoms (its backbone trace) hy P = {P\,. .. ,P„). 

A regular lattice L is defined by a set of neighboring vectors v G A^^ of equal length (V,7. ,r gA^^ : | v";] = 
|v,-|), each with a reverse {yyeNt '■ — v G Nl, such that L = {x\x = Y^VieNt^i '"^i ^^i G '^q}- \Ni\ gives 
the coordinate number of the lattice L, e.g. 6 for 3D-cubic or 12 for the FCC lattice. All neighboring 
vectors v ^ Nl oi the used lattice L are scaled to a length of 3.8A, which is the mean distance between 
consecutive C^-atoms in real protein structures. 

A backbone-only lattice protein structure M of length n is defined by a sequence of lattice nodes 
M = (Ml , . . . ,M„) G L" representing the backbone (Ca) monomers of each amino acid. A valid structure 
ensures backbone connectivity (V,<„ : M,- — M,+i G Nl) as well as selfavoidance (V;^y : M,- ^ Mj), i.e. it 
represents a selfavoiding walk (SAW) in the underlying lattice. 

The PCFL problem is to find a lattice protein model M of a given protein's backbone P, such that a 
distance measure between M and P (dist(M,P)) is minimized |[T9l . 

In this contribution, we tackle the PCFL refinement problem. Here, a protein backbone P as well 
as a first lattice model M is given, e.g. derived by a greedy chain growth procedure flT, '20', '2T|. The 
problem is to find a lattice model M' , such that dist(M',P) < dist(M,P), via a relaxation/refinement of 
the original model M. 

In the following, we utilize distance RMSD (dRMSD, Eq. [T]l as the distance measure dist(M,P). 
dRMSD is independent of the relative orientation of M and P since it captures the model's deviation 
from the pairwise distances of C^-atoms in the original protein. Minimizing this measure optimizes the 
lattice model obtained. 



y n{n-\)/l 

3 Refinement of Lattice Models: a Constraint Model in COLA 

In this section we formalize a Constraint Optimization Problem (COP) to solve the PCFL refinement 
problem (see Sec. |2]l, i.e. to refine a lattice model M of a protein P. The input is the original protein P 
and its lattice model M to be refined. The output is a lattice model M' derived from M via some relaxation 
that optimizes our distance measure dRMSD (M',P) (Eq.fTll. 

We first formalize the problem and show how to implement it in COLA, a Constraint solver for 
LAttices [2 13 . This is followed by an altered formulation that utilizes hmited discrepancy search ||9||. 
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3.1 The Constraint Optimization Problem 

The COP can be formalized as follows: 



Xi . . . Xn variables representing M' = {M[ ,...,M'„) 

D {Xi ) variable domains = {v|vGLA|v — M,| < /scale • '^max } > 

i.e. an M, surrounding sphere with radius /scale • d^ax 

SAW{Xi...Xn) self-avoiding walk constraint, e.g. split into a chain of binary 

contiguous and a global alldif f erent constraint 

O objective function variable, implements dRMSD 

= Li<ji\Xj -Xi\ - \Pj-Pi\f to be minimized 

Note that d^^x refers to the number of lattice units used and thus it is scaled to the correct distance of 
/scale = 3.8A. Thus, the domains for dmax = only contain the original lattice point M, (domain size 1), 
while dmax = 1 results in M, as well as all neighbored lattice points (domain size 1 + 12 = 13 in FCC). 
The domain size guided by d^ax defines the allowed relaxation of the original lattice model M to be 
refined. For more details about global constraints for protein structures on lattices, the reader can refer 
tolUJIial. 

The COLA implementation takes advantage of the availability of 3D lattice point domains and dis- 
tance constraints. The implementation changes the original framework only in the input data handling 
and objective function definition. A working copy of COLA and the COP implemented for this paper 
are available at http : //www2 . unipr . it/~dalpalu/COLA/ 

3.2 Limited Discrepancy Search 

A simple enumeration with d^ax = 1 and a protein of length 50, already shows that the search space 
of the COP from the previous section is not manageable. In this example, each point can be placed in 
13 different positions in the FCC lattice, and even if the contiguous constraint among the amino acids is 
enforced, the number of different paths is still beyond the current computational limits. 

We tried a simple branch and bound search a.nXi,... ,X„, where the dRMSD bound is estimated by 
considering the possible placement of non labeled variables and the best dRMSD contribution provided 
by each amino acid. In detail, each amino acid s not yet labeled is compared to each other amino acid {s'). 
Each pair provides a range of different contributions to dRMSD measure, depending on the placement 
of s and the placement of the other amino acids (when not yet labeled). A closed formula computation 
(rather than a full enumeration of all combinations), based on bounding box of domain positions, is 
activated, in order to estimate the minimal contribution. Clearly, this estimation is not particularly suited, 
since we relax the estimation on R^, where the null (best) contribution can be easily found as soon as 
the bounds on \Xs — Xv'| include the value \Ps — P/|. Unfortunately, the discrete version requires a more 
expensive evaluation that boils down to full pair checks. Therefore, the current bound is very loose and 
the pruning effects are modest. 

A general impression is that the dRMSD measure presents a pathological distribution of local min- 
ima, depending on the placement of amino acids on the lattice. In general, due to the discrete nature of 
the lattice, the modification of a single amino acid's position can drastically vary its contributions to the 
measure. 



Lattice model refinement of protein structures Mann & Dal Palii 



Protein ID 


8RXN 


ICKA 


2FCW 


length 


52 


57 


106 



Table 1: Used proteins from the Protein Data Base (PDB) 0. 

These considerations suggested us to focus on the identification of solutions that improve the dRMSD 
w.r.t. M rather than searching for the optimal one. In terms of approximated search we tried to capture 
the main characteristics of the COP and design efficient and effective heuristics. 

A simple idea we tested is the limited discrepancy search ||3. This search compares the amino acid 
placements in the lattice models M and M' . Every time a corresponding amino acid is placed differently 
in the two conformations, we say that there is a discrepancy. We set a global constraint that limits the 
number of deviations to at most K. This allows to generate conformations that are rather similar to M, 
especially if Jmax is greater than 1. The rational behind this heuristics is that we expect that potential 
conformations M' improve the dRMSD only when contained in a close neighborhood of the M structure. 

The count of the number of discrepancies K is implemented directly in COLA at each labeling step. 

3.3 Results 

We summarize here the preUminary results coming from the COLA implementation of a A' discrepancy 
search in 3D FCC lattice. 

The initial lattice models to be refined were generated using the LatFit tool from the LatPack pack- 
age |[T6l[T7l . LatFit implements an efficient greedy dRMSD optimizing chain growth method and was 
parameterized to consider the best 100 structures from each elongation for further growtq^ 

We test three proteins (Table [T]) and for each of them we input the conformation M obtained from 
the greedy algorithm (LatFit). Table |2] reports the best dRMSD of our new model M' found depending 
on (imax and the number K of amino acids placed differently from the input conformation. Furthermore, 
time consumption for each parameterization is given. 

Note that if either ^ = or Jmax = only the input structure resulting from the greedy LatFit run can 
be enumerated. 

These results, yet preliminary, offer an interesting insight about the distribution of suboptimal so- 
lutions. It is interesting to note, e.g., that better solutions are found by allowing a rather large local 
neighborhood for a few amino acids (cfmax parameter). On the other side, it seems that few modifications 
{K) are sufficient to alter the input sequence and obtain a better conformation. 

In Figure [T] we exemplify the gain of model precision for the protein 8RNX. Only the relaxation of 
K = 4 monomers enables the structural change that leads to a dRMSD drop from 1.2469 down to 1.0884, 
an improvement of about 13%. A movement of less monomers would not enable such a drastic change. 
This depicts the potential of a local search scheme that iteratively applies a series of such structural 
changes. 

Investigating the time consumption (Table [2]l one can see that the runtime increases drastically with 
K which governs the search tree size. The domain sizes implied by cfmax do not show such an immense 
influence. 

The behavior encountered is an indicator that a search based on exploring only the neighborhood 
should provide efficient and good suboptimal solutions. In the next section we briefly discuss some 
promising approaches that we plan to investigate. 



'For details on the LatFit metiiod see 1171 and the freely available web interface at http://cpsp. inf ormatik. 
|tini-f reibtirg . de 
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Figure 1 : Tiie initial lattice model M (red) of the protein chain P (blue, balls) and the final/refined lattice 
model M' (green) resulting from dmax = 2 and i^ = 4 for protein 8RNX. Note, only the altered loop 
regions (residue 2-14) are shown, but the whole structure models M and M' were superpositioned to P 
independently. 



dRMSD 



time in seconds 







K 






8RXN 


1 


2 


3 


4 





1.2469 


1.2469 


1.2469 


1.2469 


, 1 

'-'max ry 


1.2319 


1.2172 


1.1639 


1.1189 


1.2319 


1.1674 


1.1596 


1.0884 


3 


1.2319 


1.1674 


1.1596 


1.0884 






K 






ICKA 


1 


2 


3 


4 





1.2370 


1.2370 


1.2370 


1.2370 


, 1 

"max T 


1.2226 


1.2226 


1.2226 


1.2226 


1.2026 


1.1887 


1.1887 


1.1887 


3 


1.2026 


1.1887 


1.1887 


1.1887 






K 






2FCW 


1 


2 


3 


4 





1.1353 


1.1353 


1.1353 


1.1353 


, 1 

"max T 


1.1353 


1.1324 


1.1317 


1.1309 


1.1321 


1.1300 


1.1254 


1.1200 


3 


1.1321 


1.1300 


1.1254 


1.1200 









K 




8RXN 


1 


2 


3 


4 





0.048 


0.081 


0.040 


0.039 


, 1 

"max ry 


0.112 


0.790 


2.365 


20.70 


0.068 


0.983 


6.500 


106.6 


3 


0.106 


0.499 


7.399 


124.0 








K 




ICKA 


1 


2 


3 


4 





0.031 


0.030 


0.027 


0.037 


, 1 

"max ^ 


0.402 


0.615 


3.442 


39.27 


0.225 


0.456 


7.595 


120.6 


3 


0.421 


0.616 


8.573 


140.2 








K 




2FCW 


1 


2 


3 


4 





0.043 


0.050 


0.058 


0.078 


, 1 

"max ^ 


0.118 


1.997 


49.99 


1128 


0.294 


7.192 


341.8 


14235 


3 


0.332 


8.129 


394.5 


16140 



Table 2: (i^ax and K influence on discrepancy search measured in dRMSD and time. 
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3.4 Future work 

In our opinion, a framework tiiat integrates CP and Local Searcli is particularly suited to generate fast 
suboptimal solutions, potentially very close to the optimal one. We identify some possible directions that 
we believe are excellent candidates to model and solve approximately the PCLF problem: 

• local neighboring search |3l iT): this technique allows to integrate Gecode and Local Search 
frameworks. The framework handles constraint specifications and local moves within C++ pro- 
gramming language; 

• /c-local moves |25|: the idea here is to apply structural changes on k consecutive amino acids and 
repeat the process in a Monte-Carlo and/or simulated annealing style. 

• side chain model lITSi : our model can be extended to include side chains and we could exploit a 
similar set of local moves. 

• the framework presented in Il30l : COLA is here extended and combined directly to a Local 
Search approach based on pull moves fTT). 

4 Conclusion 

In this paper we presented a Constraint Programming based model for the refinement of lattice fitting of 
protein conformations. A simple branching was shown to be ineffective and a limited discrepancy search 
was modeled and shown to be beneficial to the identification of suboptimal solutions. A prototypical 
implementation in the framework COLA and some preliminary results have shown the feasibility of the 
method. We believe that an extension of the framework to Local Search is particularly suited for the 
PCLF problem at hand. 

Acknowledgments This work is partially supported by PRIN08 Innovative multi-disciplinary approaches 
for constraint and preference reasoning and GNCS-INdAM Tecniche innovative per la programmazione 
con vincoli in applicazioni strategiche. 
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